Goto

Collaborating Authors

 fuzzy clustering


FCPCA: Fuzzy clustering of high-dimensional time series based on common principal component analysis

Ma, Ziling, López-Oriona, Ángel, Ombao, Hernando, Sun, Ying

arXiv.org Machine Learning

Clustering multivariate time series data is a crucial task in many domains, as it enables the identification of meaningful patterns and groups in time-evolving data. Traditional approaches, such as crisp clustering, rely on the assumption that clusters are sufficiently separated with little overlap. However, real-world data often defy this assumption, exhibiting overlapping distributions or overlapping clouds of points and blurred boundaries between clusters. Fuzzy clustering offers a compelling alternative by allowing partial membership in multiple clusters, making it well-suited for these ambiguous scenarios. Despite its advantages, current fuzzy clustering methods primarily focus on univariate time series, and for multivariate cases, even datasets of moderate dimensionality become computationally prohibitive. This challenge is further exacerbated when dealing with time series of varying lengths, leaving a clear gap in addressing the complexities of modern datasets. This work introduces a novel fuzzy clustering approach based on common principal component analysis to address the aforementioned shortcomings. Our method has the advantage of efficiently handling high-dimensional multivariate time series by reducing dimensionality while preserving critical temporal features. Extensive numerical results show that our proposed clustering method outperforms several existing approaches in the literature. An interesting application involving brain signals from different drivers recorded from a simulated driving experiment illustrates the potential of the approach.


Fuzzy Clustering with Similarity Queries

Neural Information Processing Systems

The fuzzy or soft k -means objective is a popular generalization of the well-known k -means problem, extending the clustering capability of the k -means to datasets that are uncertain, vague and otherwise hard to cluster. In this paper, we propose a semi-supervised active clustering framework, where the learner is allowed to interact with an oracle (domain expert), asking for the similarity between a certain set of chosen items. We study the query and computational complexities of clustering in this framework. We prove that having a few of such similarity queries enables one to get a polynomial-time approximation algorithm to an otherwise conjecturally NP-hard problem. In particular, we provide algorithms for fuzzy clustering in this setting that ask O(\mathsf{poly}(k)\log n) similarity queries and run with polynomial-time-complexity, where n is the number of items.


Fuzzy Clustering by Hyperbolic Smoothing

Masis, David, Segura, Esteban, Trejos, Javier, Xavier, Adilson

arXiv.org Machine Learning

We propose a novel method for building fuzzy clusters of large data sets, using a smoothing numerical approach. The usual sum-of-squares criterion is relaxed so the search for good fuzzy partitions is made on a continuous space, rather than a combinatorial space as in classical methods \cite{Hartigan}. The smoothing allows a conversion from a strongly non-differentiable problem into differentiable subproblems of optimization without constraints of low dimension, by using a differentiable function of infinite class. For the implementation of the algorithm we used the statistical software $R$ and the results obtained were compared to the traditional fuzzy $C$--means method, proposed by Bezdek.


Fuzzy Clustering Using HDBSCAN

#artificialintelligence

Like most undergraduates right out of college with little to no first-hand experience working on industry ML projects and loads of ML/python certifications, I joined the Business Intelligence team at Samsung. There were 3 new hires in the team and there was only 1 Data Scientist (DS) position available, the other 2 were Data Engineering. With the 3 of us riding the ML wave, we all sought the Data Scientist position. During the first meeting with our manager, you can imagine the amount of malarkey all the candidates spat out to get the position. We were given a 3-week trial period during which each of us had a Data Engineering pipeline to build and perform an Exploratory Data Analysis on a given dataset.


Application of Fuzzy Clustering for Text Data Dimensionality Reduction

Karami, Amir

arXiv.org Machine Learning

Large textual corpora are often represented by the document-term frequency matrix whose elements are the frequency of terms; however, this matrix has two problems: sparsity and high dimensionality. Four dimension reduction strategies are used to address these problems. Of the four strategies, unsupervised feature transformation (UFT) is a popular and efficient strategy to map the terms to a new basis in the document-term frequency matrix. Although several UFT-based methods have been developed, fuzzy clustering has not been considered for dimensionality reduction. This research explores fuzzy clustering as a new UFT-based approach to create a lower-dimensional representation of documents. Performance of fuzzy clustering with and without using global term weighting methods is shown to exceed principal component analysis and singular value decomposition. This study also explores the effect of applying different fuzzifier values on fuzzy clustering for dimensionality reduction purpose.


Generalizing k-means for an arbitrary distance matrix

Szalkai, Balázs

arXiv.org Machine Learning

The original k-means clustering method works only if the exact vectors representing the data points are known. Therefore calculating the distances from the centroids needs vector operations, since the average of abstract data points is undefined. Existing algorithms can be extended for those cases when the sole input is the distance matrix, and the exact representing vectors are unknown. This extension may be named relational k-means after a notation for a similar algorithm invented for fuzzy clustering. A method is then proposed for generalizing k-means for scenarios when the data points have absolutely no connection with a Euclidean space.